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Abstract. We introduce an algorithm for the segmentation of a class of regime switching processes. The 
segmentation algorithm is a non parametric statistical method able to identify the regimes (patches) of a 
time series. The process is composed of consecutive patches of variable length. In each patch the process is 
described by a stationary compound Poisson process, i.e. a Poisson process where each count is associated 
with a fluctuating signal. The parameters of the process are different in each patch and therefore the time 
series is non-stationary. Our method is a generalization of the algorithm introduced by Bernaola-Galvan, 
et al, Phys. Rev. Lett., 87, 168105 (2001). We show that the new algorithm outperforms the original one 
for regime switching models of compound Poisson processes. As an application we use the algorithm to 
segment the time series of the inventory of market members of the London Stock Exchange and we observe 
that our method finds almost three times more patches than the original one. 

PACS. 02.50.Ey Stochastic processes - 05.45.Tp Time series analysis - 89.65.Gh Economics; econophysics, 
financial markets, business and management 



1 Introduction 

Many time series from natural and social phenomena ex- 
hibit non-stationarity. A proper detection and character- 
ization of this non-stationarity is a major challenge in 
time series analysis. Among the possible types of non- 
stationarities, regime switching (or mosaic organization) 
plays a major role. In regime switching models the param- 
eters of the model change abruptly from time to time and 
the time series is organized in consecutive patches, each 
characterized by a distinct set of model parameters. There 
is a vast literature on regime switching models [1] perhaps 
the best known of which are Hidden Markov Models 0. 
Most models describe processes in discrete time, although 
there are also models in continuous time (i.e point pro- 
cesses) [3]- 

Regime switching models have been applied to a large 
variety of systems. One of the first applications comes from 
research in quality control, where one wishes to detect de- 
viations from an expected output level of a production 
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process by observing a signal @]. A more recent exam- 
ple is human heartbeat interval dynamics. Studies showed 
[S] that the time series of heartbeat intervals are orga- 
nized in consecutive temporal segments with different lo- 
cal mean heart rates. Proper segmentation of the heart- 
beat data can bring up relevant physiological information, 
with the parameters differentiating between healthy and 
ill patients. Other examples include changes in economic 
regression models; financial data analysis; transport net- 
works on which the flow of some quantity can be studied 
(electric-energy networks, internet traffic); geological data 
and seismic signal processing (appearance of tsunamis and 
earthquakes); epidemiology; statistical image processing; 
and the appearance of shock wave fronts [B] . 

An example that we will consider in this paper is the 
inventory time series of financial market members. Mar- 
ket members trading a large amount of shares tend to 
split their order into smaller transactions in order to limit 
their own impact on the market [7)8|9) . This strategic or- 
der splitting leads to long regimes when a market member 
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is consistently buying or selling. By studying the inven- 
tory time series (the number of stocks owned at any given 
moment by a market member) these regimes can be iden- 
tified and importairt information on traders' behavior can 
be assessed. 

Given an empirical time series assumed to be described 
by a regime switching process, the model fitting must be 
able to determine both the boundaries between consec- 
utive patches and the model parameters in each patch. 
Given the probabilistic nature of the model, the boundary 
between two consecutive regimes is not directly observable 
and must be inferred from the data. A segmentation algo- 
rithm is a statistical method able to identify the different 
regimes (patches) of a regime switching time series. 

Here we generalize an existing algorithm 5J , originally 
developed to identify non-stationarity in the heart rate, as 
a method for segmenting regime switching processes where 
each regime can be characterized by a compound Poisson 
process. A compound Poisson process is a point process 
characterized by a rate (like a Poisson process) and a sig- 
nal intensity distribution. In a normal Poisson process the 
signals are always of unit intensity. In a compound Poisson 
process the associated counting process (i.e. neglecting the 
intensity of the signals) is a Poisson process and the sig- 
nal intensities are independent and identically distributed 
random variables and are also independent of the count- 
ing process. A compound Poisson process is stationary 
and is the simplest example of a so-called marked point 
proces^ Physical examples include the celebrated Con- 
tinuous Time Random Walk [TT] , which is the integral of a 
marked point process. Marked point processes have been 
applied to many different fields, ranging from earthquakes 
to financial time series |13I14I15] . 

In this paper we consider non stationary regime switch- 
ing stochastic processes in which in each patch the time 
series is described by a compound Poisson process. The pa- 
rameters of the process are different in different patches. 
The length of the patches where the process is stationary 
is a random variable with a given distribution. Moreover 
the length of the patches is described by an independent 
and identically distributed stochastic process. Note that 
the length of the patches is not necessarily exponentially 
distributed because the Poisson nature of the process is 

^ A marked point process is a point process where both the 
counting process (rate) and the signal intensity (jump size) 
process are generic stochastic processes. Moreover, in general, 
in a marked point process the counting process and the jump 
process are not independent |10j . 



inside each patch and it is not describing the boundaries 
between different patches. 

Regime switching models of compound Poisson pro- 
cess have several different applications. One case is the 
disorder problem formulated by Kolmogorov [16] . In a dis- 
order problem one has to detect as quickly as possible a 
change in the probabilistic properties of the observed pro- 
cess. Natural applications of this problem are quality con- 
trol or any case in which an alarm has to be raised quickly. 
The compound Poisson disorder problem has been widely 
studied in the literature |17I18I19|^ . In these processes 
either the arrival rate, or the jump distribution, or both 
changes abruptly at an unknown and unobservable time. 
The type of studied processes is the same as the one in- 
vestigated in our paper. The difference is that, while in 
the disorder problem one investigates the process in real 
time and tries to detect a regime shift quickly, in our case 
we consider the whole time series and find the different 
regimes ex-post. Other typical applications of compound 
Poisson processes and regime shifts are earthquakes |21) . 
meteorological data [22], and packets in Ethernet traffic 
[23] . All these cases can be described by compound Pois- 
son processes, and the identification of regime shifts helps 
in identifying systemic changes in the generating process. 

The segmentation algorithm that we introduce here is 
a generalization of the method introduced in [S] . This is a 
top-down method, i.e. it first splits the whole time series 
in two subsets and then continues iteratively by breaking 
the series down to a more and more refined partition. One 
of the advantages of this method and of our generalization 
is that they are non parametric methods, i.e. they do not 
postulate a known distribution of the patch length and of 
the signal intensity (or jump size). The method of Ref. 
[5] has been recently used in Refs. |8I24| for segmenting 
financial time series of the market members' trading ac- 
tivity in the Spanish Stock Exchange and in the London 
Stock Exchange (LSE) . In this paper we will also consider 
the application of our algorithm to the segmentation of 
inventory time series of LSE market members. In order to 
do this in a financially reasonable way, we generalize the 
null model considering the possibility of long inactivity 
periods between two consecutive patches. 

The paper is organized as follows. In Section [2] we dis- 
cuss our null-models of time series. In Section [3| we intro- 
duce our segmentation methods and present simulation 
results. We present empirical results for the inventory of 
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market members at the LSE in Section |4] and finish the 
paper with conclusions in Section [5] 

2 Null models 

We present two variants of the null model of time series 
that we expect to segment. The first is a pure regime 
switching model of compound Poisson processes, while in 
the second there may be long inactivity periods between 
consecutive active patches. This second model is designed 
to better describe the financial time series investigated in 
Section H 

In the first model the patches with different Poisson 
rate of activity follow each other. We consider the sim- 
plest case in which the jump size can be ±1. Since jump 
sizes are independently and identically distributed ran- 
dom variables, the jump sizes in a patch are statistically 
characterized by one probability (say, the probability of 
a jump size equal to -1-1). The sign associated with the 
largest probability is the dominant sign of the patch. 

In the following we present results where the length of 
the patches, Tact, is distributed according to a lognormal 
distribution. 

Tac,^Cac,e'"^'''\ (1) 

where Cact determines the characteristic length of the patches 
and A'(l, 1) is a normal distribution with mean /J = 1 and 
O = 1 . The choice of a lognormal distribution is made in 
order to have a fat tailed distribution of regime lengths as 
in many real world systems. However the choice of a dif- 
ferent distribution for the patch lengths (e.g. exponential 
distribution) does not affect the quality of the results. The 
mean and standard deviation of Tact are 

<r„,, 6^/2 C„,,~ 4.48 C„rt (2) 
^Tac, = ^(ei-l)e3 Cac, ^5.87 Cact (3) 

In order to generate a realization of the process, for each 
patch we choose a patch length Tact, a dominant sign (±1 
randomly) , a noise level < r| < 1 , and a rate a. The dom- 
inant sign and the noise level determine the jump pro- 
cess. With probability 1 — r| the jump has size equal to 
the dominant sign and with probability r] it has the op- 
posite sign. The rate a determines the counting process, 
i.e. the probability per unit time that an event occurs is a. 
This parameter is drawn in each patch from a uniform dis- 
tribution in the interval [0.5 — 5,0.5-1-5], where < 5 < 0.5 
is the dispersion of the rate. For each patch we generate 



1000 




1^000 17,000 19,000 21,000 23,000 25,000 27,000 29,000 

global time 

Fig. 1. Snapshot of the integral of a simulated time series 
of a regime switching models of compound Poisson processes 
with inactivity periods, with parameters Cact = 50, Ci„act = 30, 
8 = 0.2 and r) = 0. 



a compound Poisson time series of length Tact with the 
chosen parameters. 

In the second type of null model we also add inac- 
tive patches, i.e. patches when no jumps are present be- 
tween each pair of consecutive active patches. The active 
patches are generated in the same manner as before. In- 
active patches have length Tinact distributed according to 
a lognormal distribution: 

' Tffjact ^ Cinact^ ^ ' \ (4) 

where, as before, Ci„act determines the characteristic length 
of the patches. Again, choosing some other distribution 
for the patch lengths does not affect the results. Figure 
[T] shows a snapshot of the integral of the time series. In 
our simulations we varied the ratio < Tj„act > / < Tact >~ 
Cinact /Cact between the mean length of active and inactive 
patches, sweeping the ratio from 0.02 (almost negligible 
length of inactivity) to 1 (equal mean length of active and 
inactive patches). 

We simulate the process in discrete time with steps 
of unit time length. Moreover we consider two different 
times for the process. The first is the global time and is the 
(coarse grained) real time. The second is the local time in 
which we consider only the times when an event occurs. In 
other words, in local time we discard all the times when 
there are no events and end up with a time series con- 
taining only ±1. Local time discards all the information 
related to the point process nature of the time series and 
preserves only the jump (event) process. 



4 



Bence Toth et al.: Segmentation algorithm for non-stationary compound Poisson processes 



3 Segmentation method 

3.1 Segmenting regime switching models of compound 
Poisson processes 

As we have mentioned above, our segmentation algorithm 
is based on the algorithm in Ref. [5j . This algorithm works 
as follows. We move a sliding pointer over the series and 
for each position of the pointer we measure the mean of 
signal in the subset to the left and to the right of the 
pointer. We compute the statistic 

t ^\{meft-tiright)/sDl (5) 

where 

- [{slfr+s%,,)/{Ni,f,+N,,,H,-2)]"\\/Nuf, = l/Nr,gh,Y 

(6) 

is the pooled variance [25_, ^left and /J„g/ir are the mean of 
the signal, s/e/, and slight the standard deviations and A^/e/r 
and Nright the number of data points to the left and right 
of the pointer, respectively. We search for the position of 
the pointer for which the t statistic of Eq. [5] is maximal 
(tmax)- We make a cut if the significance of t^ax exceeds a 
given threshold, which we set to 99% as in Ref. [Sj. Note 
that one has to modify the relation between the statistic 
and the p- value according to [5] . After the cut we continue 
the method recursively on the newly created subsequences, 
until no further cut can be made. It is important to note 
that before a new cut is accepted we compute the modified 
t value between the new segments and their neighbors and 
check if both values exceed the above significance level. 

One direct method for segmenting a compound Poisson 
process is to consider the time series in local time and run 
the algorithm of Ref. [5] . We will refer to this method as 
the local time t-test throughout the paper. This approach 
has been used in Ref. [8 for segmenting inventory time 
series in financial markets (see also Section |4]) . However 
this approach is unable to identify changes in rate due to 
a regime shift because all the information on the rate is 
lost when one considers the compound Poisson process in 
local time. 

We introduce a new method for segmenting compound 
Poisson processes, which we call the global time t-test. The 
idea is to apply the algorithm to the time series in global 
time, i.e. our series will be composed by many zeroes (in 
the coarse grained time intervals, when no event occurs) 
and ±1 (when one event occurs). Naturally the power 
of the segmentation method depends considerably on the 



noise level r| and on the variance of the rate of Poisson 
processes inside different patches. In order to study this 
we made simulations for different values of the dispersion 
of the rate of the Poisson process in different patches. As 
specified in the previous section, the rate is drawn from 
a uniform distribution in the interval [0.5 — 8,0.5 + 5] and 
we vary 5. After generating the time series we run the lo- 
cal time t-test and the global time t-test. We compare the 
segmentation made by each method with the true segmen- 
tation. All the results are the average of 100 simulation 
runs and each run is made of 100 patches. 

For the assessment of the segmentation methods we 
compute the Jaccard index [26] between the detected seg- 
mentation and the true segmentation. We denote the num- 
ber of point pairs that arc in the same patch both in the 
true time series and the segmented series by Mn. The 
number of point pairs that are in the same patch in the 
true time series but not in the segmented series is Mjo, 
and similarly the number of pairs in different patches in 
the true time series but in the same patch in the segmented 
series is Moi. The Jaccard index is defined as 



Mio+Moi+Mn' 

A perfect segmentation has J — 1. In our case there are 
two possible ways of computing the Jaccard index. The 
first is to compute the Jaccard index in local time, i.e. by 
restricting the comparison to the time series where we have 
removed all the zeroes. We denote this measure by J Local- 
The second way takes into account all time steps (even 
when there are no counts) and we denote it by J Global- 
JlocuI tells us what fraction of the signals are segmented 
well, while Jciotai tells us, what fraction of global time is 
segmented well. 

We studied both Jiocai and Jciobai for growing disper- 
sion of the Poisson rates and different noise levels. Figure 
[5] shows JLocai for the local time t-test and the global time 
t-test. We also included the results that come from en- 
tirely random segmentation of the time series, applying 
the same number of cuts as the local time t-test. In this 
figure we set r| = 0. We can clearly see that for low val- 
ues of 5 the two methods perform roughly the same, while 
with growing 5 the global time t-test outperforms the lo- 
cal time t-test by a large margin. The same result is seen 
in Figure |3] where we show J Global - 

Figure |4] shows the local Jaccard index Jiocai for a 
model with 5 = 0.5 and for different values of r|. Clearly, 
■IlocuI decreases as the noise level increases, but for all val- 
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Fig. 2. (Color online) The local Jaccard index Jiocal a^s a 
function of the dispersion 5 of the rate a for the local time 
t-test (black circles), and the global time t-test (blue stars). 
The black dashed line shows the Jaccard index for random 
segmentation. Here T] = 0. 
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Fig. 5. (Color online) The global Jaccard index Jciohal a 
function of the noise r\ for the local time t-test (black circles), 
and the global time t-test (blue stars). Here 5 = 0.5. 
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Fig. 3. (Color online) The global Jaccard index Jciobat a-s a 
function of the dispersion 5 of the rate a for the local time 
t-test (black circles), and the global time t-test (blue stars). 
The black dashed line shows the Jaccard index for random 
segmentation. Here t) = 0. 
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Fig. 4. (Color online) The local Jaccard index Jiocal as a 
function of the noise T) for the local time t-test (black circles), 
and the global time t-test (blue stars). Here 8 = 0.5. 



ues of T] < 0.4 the global time t-test outperforms the local 
time t-test segmentation algorithm, and for 0.4 < r| < 0.5 
the two methods perform the same. A similar result is ob- 
served also with the global Jaccard index, as can be seen 
in Figure [5] 



3.2 Segmenting regime switching models of compound 
Poisson processes with inactivity periods 

For regime switching models of compound Poisson pro- 
cesses with inactivity periods we modify the above algo- 
rithm. The segmentation algorithm, called the composite 
test, has two modules. First we run the global time t-test 
as described above. After the global time t-test, there is a 
second module that we call the rate test: we check if the 
patches found by the global time t-test are consistent with 
the Poisson assumption. For each patch we estimate a as 
the inverse of the average inter-event time in the patch. 
Then for each patch, we search for the longest period of 
inactivity between two events and we test whether it is 
consistent with Poisson waiting times. Given a Poisson 
process with rate a, the probability of having at least one 
waiting time longer than W is 

P(co™„>H^) = l-(l-e-«'^f , (8) 

where is the number of jumps. Given the longest ob- 
served waiting time (Hobs-, we make additional cuts if the 
probability of having a waiting time (si„ax > ©oij is less 
than a given threshold q i.e., 



log(l- (1-^)1/^) 



-a 



(9) 



In this paper we use q = 0.01 . If Eq. |9]holds, we make two 
additional cuts, one at the beginning and one at the end 
of the inactivity period and then we continue the process 
recursively on the new sequences. 

To present the power of our composite test we show 
how it works for simulated time series of regime switch- 
ing models of compound Poisson processes and inactivity 
periods as described in the previous section. For activity 
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Fig. 6. (Color online) The local Jaccard index Jiocal for the 
local time t-test (black circles), the global time t-test (blue 
stars) and the composite test (red diamonds). The black dashed 
line shows the Jaccard index for random segmentation. 
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Fig. 7. (Color online) The global Jaccard index Jciobal for 
the local time t-test (black circles), the global time t-test (blue 
stars) and the composite test (red diamonds). The black dashed 
line shows the Jaccard index for random segmentation. 



periods we draw the Poisson rate randomly from the in- 
terval ae [1/15, 1/5]. We consider here the noiseless case 
(ri =0), but we observe similar results for all values of 
r|. After generating the time series we test the local time 
t-test segmentation method and the two methods intro- 
duced here, the global time t-test and the composite test. 
We compare the segmentations obtained by each method 
with the generated segments by computing the Jaccard in- 
dex in local time and in global time. As before, all results 
are the average of 100 simulation runs. 

Figure |6] shows the results for J Local- We present the 3 
methods, the local time t-test (black circles), the global 
time t-test (blue stars) and the composite test (red dia- 
monds). With the dashed line we also show the Jaccard 
index obtained by an entirely random segmentation, ap- 
plying the same number of cuts as in the local time t-test. 
It can be clearly seen that the curve of the composite 
test is well above the other curves. For very short inac- 
tivity periods the 3 methods perform roughly the same, 
while for growing length of inactivity periods the perfor- 
mance of the composite method (red diamonds) quickly 
becomes better. The local time t-test (black circles) per- 
forms uniformly when changing the ratio of active and 
inactive patch length. 

Figure [7] shows the results for J Global- Again we see 
that for very short inactivity periods the 3 methods per- 
form roughly the same, while for longer inactivity patches 
the composite method (red diamonds) significantly out- 
performs the other methods. The global time t-test per- 
forms roughly the same as Cinact/Caa changes, while the 
performance of the local time t-test declines as the relative 
length of inactivity periods increases. 



From both plots we conclude that our proposed com- 
posite test outperforms the direct application of the orig- 
inal test to time series which can be described by regime 
switching models of compound Poisson processes (with 
or without inactivity periods). As already mentioned, the 
relative performance of the segmentation methods does 
not depend on the choice of the distribution of the patch 
lengths, Tact a-iid Ti„act- The qualitative results do not change 
when using other distributions. This in fact is one of the 
strengths of our method: being a non parametric method, 
it does not postulate a known distribution of the patch 
length and of the signal intensity. 

4 Empirical analysis of financial data 

As already mentioned in the introduction, a real example 
where the proposed segmentation methods can be very 
useful is the analysis of financial inventory time series and 
the determination of hidden orders. 

It has recently been shown that the signed market or- 
der flow, i.e. the time series of the sign of the market orders 
initiating trades, is a long memory process [271281 . It has 
been proposed that the long memory property is due to 
the practice of order splitting [7^. When a large investor 
decides to trade a large volume, it is quite unlikely that 
she places one large order in the market. What typically 
happens is that large trading orders are split into pieces 
and executed incrementally. We call these large orders hid- 
den orders. For strategic reasons traders attempt to keep 
the true size of their orders secret in order to minimize 
transaction cost and trade the order at a more favorable 
price (for a review of this problem, see |29j). Empirical 
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evidence of a widespread practice of order splitting has 
been given in the empirical results of Refs |30)31|32)8|24j . 
More recently it has been shown more directly that order 
splitting is the main cause of the long memory of order 
flow 0. 

From an empirical point of view the detection of hid- 
den orders is difficult because market participants do not 
reveal their trading intentions and strategy. It has been 
recently proposed to detect hidden orders from the analy- 
sis of the time series of the inventory of market members 
[8] . Inventory variation is the signed transaction volume of 
a given market member, where conventionally a buy (sell) 
transaction has a positive (negative) volume equal in abso- 
lute value to the number of shares traded. Beside trading 
for themselves, market members act often as brokers or 
dealers. Therefore they are often not the real initiators 
of hidden orders, but the assumption of this approach is 
that it is unlikely that more than one large hidden order 
is given simultaneously to one broker. Under this assump- 
tion a large hidden order should be visible in the inventory 
time series of the broker as a random walk with a very 
strong drift. 

This approach has proven useful and has been applied 
to the determination of the market impact of hidden or- 
ders [21]. The segmentation method used in |8|24j is the 
local time t-test of [S] applied to the time series of inven- 
tory variation of each broker in local time. Now, if two 
consecutive buy hidden orders with the same typical vol- 
ume per trade (largely determined by the liquidity present 
in the market) are placed by the same broker with a time 
difference between the end of the first and the beginning of 
the second that is short compared to the length of the two 
hidden orders, a local time t-test segmentation method 
will identify them as a single hidden order, because the 
algorithm has no way of separating them. For this reason, 
the main claim of this paper is that a global time t-test 
segmentation method should work better in identifying 
hidden orders. 

One possible financial interpretation of the null model 
with inactivity periods described in Section [3. 2| is the fol- 
lowing; Brokers typically use for hidden order splitting a 
VWAP (Volume Weighted Average Price) algorithm , 
which is a simple trading protocol in which the trader 
splits the order in pieces and trades it incrementally in or- 
der to keep approximately constant the ratio, (Xv^ between 
the traded volume by him and the total volume simulta- 
neously traded on that stock in the market. We assume 



that a VWAP strategy is well represented by a Poisson 
process with a rate related to the participation rate of 
the order. The noise in the inventory takes into account 
the fact that the broker is acting on behalf of many small 
customers while working the large order. Finally, the in- 
activity period takes into account the possibility that the 
broker trades with a bursting activity corresponding to 
hidden orders and separated by inactivity periods, as re- 
cently shown empirically ,9J. 

The database we study is the on-book (SETS) mar- 
ket of the London Stock Exchange (LSE) , for the period 
of January 2002 through December 2004. The data set 
contains all orders placed together with the participant 
code of the market member placing them. A member can 
both act as a broker, i.e., handling trades for other in- 
stitutions that are not members of the market, and may 
trade for its own account. Thus a single membership code 
may lump together trades from many different institu- 
tions. We call the segments that the statistical methods 
identify, patches. Patches with a well-defined direction 
(mostly buying or mostly selling) can be understood as 
hidden orders. In the following we present results from 
empirical financial data applying both the local time t- 
test (as in Vaglica et al. [5|) and our composite test for 
segmenting the time series. 

We study the inventory variation of a market member 
in global transaction time. This means that for each trans- 
action in which the studied participant is not involved we 
set his inventory variation to 0, and when he makes a 
transaction we put -l-v for a buy trade and —v for a sell 
trade. Here v is the volume of the transaction in number 
of shares. In this way, the time series of the inventory vari- 
ation contains all information on the participation rate of 
the agent. 

We present here results on the data of AstraZeneca 
(AZN), a highly liquid stock of the LSE. We study the 50 
most active market members for this period. We consider 
patches (hidden orders) with a well defined direction. For 
this reason, as in Refs. |8I24) . we apply some filters to 
the patches found by the methods. Specifically, we define 
directed patches as those where at least 75% of the volume 
of transactions inside the patch have the same direction. 
Furthermore we only study patches that are constituted of 
at least 10 transactions. It is also important that in case of 
financial data, when applying the rate test, we only check 
for inactivity periods that are at least 50 transactions long. 
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Table 1. Summary statistics of the local time t-test and the 
composite test applied to financial data. The rows are num- 
ber of directed patches found (num), the average number of 
transactions made in a directed patch (A'), the average length 
of a directed patch in global transaction time (T), the average 
participation rate in directed patches (ocy), and the average 
fraction of market orders in directed patches {/mo)- 



measure Local time t-test Composite test 



num 3702 10613 

N 171.6 64.6 

T 1267 286 

W 0.15 0.23 

7mo 0.48 0.46 



AZN; 2002-2004 




1 L . . ■ ■ ■ ■ — 

10 10 10 10 

N 

Fig. 8. (Color online) Cumulative distribution of the patch 
length for directed patches found by the local time t-test (black 
circles) and the composite test (red diamonds). 



In real time, this corresponds on average to 20 minutes of 
trading. 

As expected the composite test identifies more direc- 
tional patches than the local time t-test. In case of AZN 
the original algorithm determines 3702 directed patches, 
while the composite test finds 10613. This difference sug- 
gests that the local time t-test segmentation method tends 
to stick several patches together, not taking into account 
changes in the rate and the presence of inactivity periods. 
This problem is taken into account by our composite test. 
Thus we expect a better resolution of the patches and less 
intrinsic error in the detection. 

As characterizing statistics of the directed patches we 
study the number of transactions made inside the patches, 
A^, the length of directed patches in global transaction 
time, r, the participation rate in the patch, a^, and the 
fraction of transactions made via market orders in the 
patch {/mo)- In Table [l] we summarize some basic statis- 
tics of the results, showing the number of directed patches 
found with each method and the average of the above mea- 
sures computed over all directed patches found. 



Figure [8] shows the cumulative distribution of the num- 
ber of transactions in the directed patches. We can see 
that the distribution of patch length found using the com- 
posite test has a smaller variance than that found with the 
local time t-test. In both cases we find fat tailed distribu- 
tions. The average number of transactions constituting a 
directed patch in case of the local time t-test is 171.6, while 
in case of the composite test it is 64.6. We find a very large 
difference in the length of directed patches in global trans- 
action time (see Table [T]): in case of the local time t-test 
T = 1267 transactions, which is almost one trading day, 
while for the composite test T = 286 transactions. 

Figure [9] shows the distribution of the participation 
rate for the directed patches found by the two methods. 
The distribution for the directed patches found by the lo- 
cal time t-test is peaked close to zero, with an average of 
0.15, while the distribution of those found by the com- 
posite test has a maximum at a higher value, and decays 
slower, having an average of 0.23. A simple reason for this 
difference might be the fact that the composite test finds 
shorter patches on average and in case of shorter patches 
it is more likely to find a high participation rate than for 
long patches (it is very hard to trade with a high partici- 
pation rate for long periods). Also, if the patches detected 
with the local time t-test are composed by several shorter 
patches detected with the composite test and interspersed 
by inactivity periods, it is clear why the rate is larger in 
the new than in the local time t-test. 



Figure 10 shows the distribution of /mo for the two 



methods. Both distributions are roughly symmetric around 
/mo — 0.5. However, we find that in case of the patches 
found by the local time t-test there is a higher weight of 
the distribution in the middle and lower values at the ex- 
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Fig. 9. (Color online) Distribution of the participation rate for 
directed patches found by the local time t-test (black circles) 
and the composite test (red diamonds). 
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Fig. 10. (Color online) The distribution of the ratio of market 
orders for directed patches found by the local time t-test (black 
circles) and the composite test (red diamonds). 



tremes. In case of the composite test the weight in the 
middle is lower and there are large values for the extreme 
cases, i.e. /mo = or /mo = 1- For /mo = we find the 
highest value of the distribution. Again these differences 
are probably due to the fact that the composite test finds 
shorter patches. In case of shorter patches, it is more prob- 
able that the participant sticks to one type of order (in our 
case limit orders). 

4.1 Comparing the segmentations 

An important question is to know the differences between 
the segmentations found by the two methods. To under- 
stand if the composite test is a "refinement" of the local 
time t-test (finding roughly the same patches and cutting 
them further into pieces) or if it makes an entirely differ- 
ent segmentation, we study the distance of the cuts made 
by the local time t-test from the nearest cut made by the 
composite test. We find that this distance is on average 
one fourth of the distance that would be found if the two 




Fig. 11. (Color online) Cumulative distribution of the patch 
lengths for directed patches found by the local time t-test 
(black circles) and the rescaled cumulative distribution of the 
patch lengths found by the composite test (red diamonds). 



cuts were made independently and on average 90% of the 
distances are lower than what we can expect in the inde- 
pendent case. This means that usually there is a cut made 
by the local time t-test not far from the cuts made by the 
composite test. 

We also measured the "nesting" of the two methods. 
Specifically, we measure the fraction of patches found by 
the composite test that are entirely contained in a patch 
found by the local time t-test. For this nesting coefficient 
for AZN we find a value of 0.87 ±0.09. This high value 
suggests that the composite test can partially be seen as a 



refinement of the local time t-test. In Figure 11 we plot the 
cumulative distribution of N for the two methods (simi- 
larly to Figure |8]), but rescaling the N values of the com- 
posite test by the ratio of A^: 171.6/64.6 ss 2.65 (see Ta- 
ble [l]). The two distributions are similar, suggesting again 
that the composite test is a homogeneous refinement of 
the local time t-test. 

We study the similarity of the two segmentations by 
computing the Jaccard index of the two patch structures. 
For the 50 studied brokerage codes we find a global Jac- 
card index oi J = 0.24 ±0.21. This average is interestingly 
not far from the value that we expect if the composite test 
were a homogeneous refinement of the local time t-test. In 
fact the Jaccard index between a partition and its refine- 
ment, where each patch in the first partition is divided in 
M patches of equal length, is 1/M. In our case this would 

lead to T composite I Tlacaltime 286/1267 w 0.22. 

5 Conclusions 

We presented two related methods for segmenting time 
series that show non-stationarity in both time (rate) and 
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intensity. Specifically our segmentation method has been 
developed and tested on regime switching models of com- 
pound Poisson processes. In this type of process the events 
occur as in a Poisson process and each event is associated 
with a signal described by an independent and identically 
distributed stochastic variable. Moreover, the time series 
is organized in patches where the parameters of the model 
are constant within each patch, while they are different 
in different patches. We have shown that our algorithms 
perform quite well in segmenting the simulated time series 
in a wide range of possible parameter distributions. 

Even if the algorithm has been designed and tested for 
regime switching models of compound Poisson processes, 
it can be used for more general regime switching models of 
marked point processes. This is due to the non parametric 
nature of the segmentation method. Therefore we expect 
that the method performs quite well in segmenting regime 
switching time series, where in each patch (regime) the sig- 
nal is a point process with an intensity associated to each 
event. Examples of time series described by marked point 
processes can be taken from a wide range of phenomena, 
for example, in physiological (neuron spikes), geophysical 
(earthquakes), astrophysical (solar flares), and socioeco- 
nomic (financial transactions) systems. 

We also presented an application of our segmentation 
algorithm to financial time series. Specifically, we consider 
the segmentation of the time series describing the inven- 
tory of market members. The method looks for periods 
of time when a market member consistently buys or sells 
at a roughly constant rate. We postulate that these peri- 
ods correspond to hidden orders, i.e. large orders that are 
split and traded incrementally over a long period of time. 
Hidden orders are an important element for a proper char- 
acterization of order flow, which in turn is one of the main 
ingredients needed to understand the price formation pro- 
cess. Given the privacy issues related to the trading activ- 
ity of market investors, the detection of hidden orders at a 
financial market scale can be performed only with the aid 
of statistical methods. Some of the recent efforts in this 
direction include the methods in Refs. |8I24I34|55] . 

The method we introduced in this paper is the first 
to consider the segmentation in global transaction time, 
i.e. it is the first that takes into account the trade rate 
of a given market member. By contrast the other existing 
methods work in local time, i.e. they discard all the infor- 
mation related to the trading rate. When we compare the 
statistics of the patches found by our algorithm with those 



found by the algorithm of Refs. |5|8] we find almost three 
times more patches. This suggests a better resolution in 
the detection of hidden orders. Comparing the results of 
the two methods we find evidence that the segmentation 
of the composite test can be understood as a refinement 
of the patches found by the local time t-test. 
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